Data Wrangling

Lennart Kasserra

CorrelAid

2024-04-22

Setup

library(dplyr)
library(tidyr)

bigfoot <- readr::read_csv(here::here("data/bigfoot.csv"))

Motivation

So far we have taken the data set at face value; but if we explore we may find some issues. For example: what is “Unknown” for the season?

bigfoot |> count(season)
# A tibble: 5 × 2
  season      n
  <chr>   <int>
1 Fall     1492
2 Spring    825
3 Summer   1867
4 Unknown    92
5 Winter    745

Nonsensical records & missing values

The easiest way of dealing with this would be to convert season to NA if it is “Unknown”:

bigfoot |> 
  mutate(season = season |> na_if("Unknown")) |> 
  count(season)
# A tibble: 5 × 2
  season     n
  <chr>  <int>
1 Fall    1492
2 Spring   825
3 Summer  1867
4 Winter   745
5 <NA>      92

But it seems for some of the unknown we actually have a date:

bigfoot |> 
  filter(season == "Unknown", !is.na(date)) |> 
  select(date, season)
# A tibble: 64 × 2
   date       season 
   <date>     <chr>  
 1 1971-10-01 Unknown
 2 1979-07-18 Unknown
 3 1980-06-30 Unknown
 4 1957-09-18 Unknown
 5 1992-01-15 Unknown
 6 1997-05-01 Unknown
 7 1956-01-01 Unknown
 8 1962-06-06 Unknown
 9 1977-01-01 Unknown
10 1986-06-01 Unknown
# ℹ 54 more rows

Nonsensical records & missing values: imputation

Let’s break the problem down (one way):

  • Imputing season from date:
    1. Extract month from date
    2. Set season to e.g. fall if month is october or november, june to september would be summer, etc. etc.

…but how?

Excourse: vectorized control flow

Generally, most languages - including R - have some form of “control flow” logic:

if (condition) {
  do_thing()
} else {
  do_other_thing()
}

Excourse: vectorized control flow

Problem: if (condition) can only check one value at a time:

if (bigfoot$season == "Unknown") print("This one!")
Error in if (bigfoot$season == "Unknown") print("This one!"): Bedingung hat Länge > 1

Excourse: vectorized control flow

dplyr provides an if_else() that is vectorized (no need to write loops or similar), which would look like:

if (condition) {
  do_thing()
} else {
  do_other_thing()
}
if_else(condition, 
  do_thing(), 
  do_other_thing()
)
  • if_else() automatically handles missing values, while base R’s ifelse() does not! You can also determine how they should be handled by setting the missing-argument.

Excourse: vectorized control flow

If condition is TRUE, it will execute true, else it executes false:

bigfoot |> mutate(reliable = if_else(classification == "Class A", 1, 0))

…will create a “dummy”-vector that is 1 if classification == "Class A" and 0 otherwise (for “unreliable” sightings).

Excourse: vectorized control flow

But control flow may become more complex:

if (condition) {
  do_thing()
} else if (other_condition) {
  do_other_thing()
} else {
  do_something_else()
}

Excourse: vectorized control flow

The solution is case_when():

if (condition) {
  do_thing()
} else if (other_condition) {
  do_other_thing()
} else {
  do_something_else()
}
case_when(
  condition ~ do_thing(),
  other_condition ~ do_other_thing(),
  .default = do_something_else()
)

Nonsensical records & missing values: imputation

Back to our problem:

bigfoot |> 
  mutate(
    # First add the month:
    month = lubridate::month(date),
    # Then impute:
    season = case_when(
      season != "Unknown" ~ season, # if season not unkown just keep it
      month %in% c(3:5) ~ "Spring", # otherwise step through these...
      month %in% c(6:9) ~ "Summer",
      month %in% c(10:11) ~ "Fall",
      month %in% c(12, 1, 2) ~ "Winter",
      .default = NA                 # ...and to this if no condition matches.
    )
  )

Handling missing values

Drop all missing values:

bigfoot |> drop_na()

Drop obs. where a certain variable is missing:

bigfoot |> filter(!is.na(season))

Drop obs. where a of a number of variables is missing:

bigfoot |> filter(!if_any(c(season, starts_with("temperature")), is.na))

i.e. don’t keep row if_any() of these columns are missing (NA)! You can also reverse this:

bigfoot |> filter(if_all(c(season, starts_with("temperature")), \(x) !is.na(x)))

i.e. only keep row if_all() of these variables have observations (are not missing)